Footballer Transfer Fee Prediction Project

Image Alt Text

Football is a game that's simple to understand but too complex to analyze statistically and analyze the statistics in a simple manner. In football, which is full of unknowns, the transfer market value of football players are important. Knowing the change in transfer market value of a player according to player's statistics is a feature that everyone in the football industry wants. This project aims to create a transfer fee prediction model based on features of footballers that are given in the dataset. Besides that, different type of statistical and visual outcomes are shown.

Installing necessary libraries

About the dataset

The dataset can be found via link: https://www.kaggle.com/datasets/khanghunhnguyntrng/football-players-transfer-fee-prediction-dataset. This dataset includes 22 columns and 10754 rows. It contains football players from 20 top leagues from all over the World and their statistics in the last 2 seasons. Columns are explained step by step during tasks. To enhance the evaluation of the prediction model, the statistics, namely "goals," "assists," "yellow cards," "second yellow cards," "red cards," "goals conceded," and "clean sheets," were transformed to a per 90 minutes basis. This transformation involved dividing each statistical value by the corresponding per 90 value (calculated as minutes played divided by 90).

Reading the csv dataset.

For better calculation and understanding, we multiply the values that are calculated per appearance with the number of appearances.

Printing first 5 rows. After that we check the column names and their data types.

The dataset does not contain null values.

Statistical distribution for each numeric columns.

We can see the correlation between the columns by using pairplot. Pairplot columns are specified according to the statistics that might have direct influence to a match score. Random 500 samples are chosen for each columns.

Besides that we plot the correlation matrix to see the correlation between each columns. Correct parameter choices are important for the efficiency of our statistical and visualization findings. Warm colors represent strong correlations, while cool colors represent weak correlations.

Image Alt Text

Can we differentiate goalkeeper performance with linear regression?

If a goalkeeper finishes the match without conceding any goals, then we consider this match as a clean sheet for the goalkeeper. Therefore, clean sheets are very crucial for the goalkeepers. In this task, we want to apply linear regression for goalkeepers and their performance statistics. By this task, we aim to find the goalkeepers with better performance during 2 seasons. We choose the players with position goalkeeper. We select the columns we are going to use.

For a better result, we do not consider the goalkeepers who played less than 20 matches in 2 seasons. Therefore, we filter the goalkeepers. Plotly express is used for interactive scatterplot. So, we can interactively take a look at goalkeepers' informations.

Based on the linear regression results, we can state that the goalkeepers who stayed above the line performed more successfully for 2 seasons than the goalkeepers that are below the line. It has been observed that while the performance increases towards the top, the performance decreases towards the bottom of the line.

Image Alt Text

Does goalkeeper height affect their clean sheet statistics?

There are stereotypes about the height of goalkeepers. It is thought that the taller the goalkeeper, the more successful he will be. In this task, we discuss the accuracy of this with our goalkeeper dataset.

According to the boxplot, goalkeepers taller than 184 cm in general are more successful. But also being taller than average goalkeeper height is not giving an advantage to a goalkeeper and there is not a significant difference between different heights after a scale.

Image Alt Text

Which teams are the most aggressive teams in top 20 leagues?

In this task, we want to find the top 20 teams with different card statistics. Therefore, we find the most aggressive teams for 2 seasons. We want to filter the dataset with teams at least have 1 record for each of the features.

Then we group by team and select the top 20 teams with most yellow cards. Yellow card is chosen because it is most common in a football match.

Can we classify players based on the percentage they play?

In this task, we want to classify the players according to rate of minutes they played. Important players always play for their teams and they are not being substituted out easily during a season.

We are going to use K-means clustering for classifying the players according to their played rate. In this way, we see the importance of players to their teams in an interactive way. Random 1500 players are picked for the classification and visualization.

Create an interactive scatterplot for better k-means clustering visualization.

Image Alt Text

Who are the players who contributed to the most goals with the least minutes played?

Goal contribution is not only scoring a goal. It also includes assists since assists are also goal passes. Therefore, we aim to find the players with lowest minutes played but with the highest goal contribution. Since we want to eliminate players with very few goal contribution and appearances, we choose players with at least 2 goals and 2 assists. At the same player must play at least 20 full matches, which is equal to 1800 minutes.

Visualizing the top 20 players with barplot. These players are an important bench strength for their teams and they are very likely to be substitutes in the game at critical moments. Also we can say that some players are young and have high potential.

Image Alt Text

Do the prizes won have a relationship with the high market value?

Football is a team sport. Awards are won as a team. We want to determine if players with a lot of rewards have reached the highest market levels.

Even though there are some players with top transfer market values and with a lot of awards. Since football is a team game, there are also some players with low market value but still won considerable amount of awards. Top of the regression line scatters are successful in terms of awards won.

Image Alt Text

Does position of a player matter for game injuries?

There are several positions in football and each of them requires different skills and features. We would like to know whether game injuries affect players according to their positions or not. Violinplot with quasirandom jitter is used for visualization.

We can easily see that players playing in the center of the field have a higher number of injuries than players playing on the wings. Again, the number of injuries increases as we move from the striker area to the goalkeeper area. We can show that the reason for this is that players playing in the center are exposed to much more physical contact. Again, we can say that goalkeepers are one of the positions that suffer the least injuries. We can ignore the 3 data on the right in the chart.

Selecting features and model prediction

First of all we use labelencoder to convert categorical features into numerical values. The new encoded values will be stored in new columns as f_encoded where f represents name of the feature.

2 nee columns are added to the dataframe.

Features are selected by considering the results of previous tasks in this project and the parameters that have the most impact on football matches. The parameters that have the most direct impact on team or individual success are considered.

StandardScaler is used for standardization of the features. Therefore, the features become more comparable.

Gradient Boosting Regressor is used because accuracy is higher compared to other learning algorithms. Structure of the dataset is compatible for Gradient Boosting Regressor.

In this project, we analyzed the relation of complex football statistics between each other statistically and visually. The purpose of the project is the transfer market value prediction model developed for football players. According to the tasks done with the dataset, there is a prediction model that works with 81% efficiency. Obviously, there are much more complex statistics in football that are not included in this dataset, and we cannot deny that these statistics affect the market value of the players as well. In a dataset where more complex data is used, prediction models can be created with much higher accuracy. Apart from this, various data science methods, different visualizations and statistics were used to find answers to various research questions, and various findings were found based on player-based, team-based and some specific statistics.